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By T. Tony Cai ^ and Lie Wang 
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We consider a wavelet thresholding approach to adaptive vari- 
ance function estimation in heteroscedastic nonparametric regression. 
A data-driven estimator is constructed by applying wavelet thresh- 
olding to the squared first-order differences of the observations. We 
show that the variance function estimator is nearly optimally adap- 
tive to the smoothness of both the mean and variance functions. The 
estimator is shown to achieve the optimal adaptive rate of conver- 
gence under the pointwise squared error simultaneously over a range 
of smoothness classes. The estimator is also adaptively within a loga- 
rithmic factor of the minimax risk under the global mean integrated 
squared error over a collection of spatially inhomogeneous function 
classes. Numerical implementation and simulation results are also 
discussed. 

1. Introduction. Variance function estimation in heteroscedastic non- 
parametric regression is important in many contexts. In addition to being 
of interest in its own riglit, variance function estimates are needed, for ex- 
ample, to construct confidence intervals/bands for the mean function and to 
compute weighted least squares estimates of the mean function. Relative to 
mean function estimation, the literature on variance function estimation is 
sparse. Hall and Carroll (1989) considered kernel estimators of the variance 
function based on the squared residuals from a rate optimal estimator of the 
mean function. Miiller and Stadtmiiller (1987, 1993) considered difference 
based kernel estimators of the variance function. Ruppert et al. (1997) and 
Fan and Yao (1998) estimated the variance function by using local poly- 
nomial smoothing of the squared residuals from an "optimal" estimator of 
the mean function. More recently, Wang, Brown, Cai and Levine (2008) 
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derived the minimax rate of convergence for variance function estimation 
and constructed minimax rate optimal kernel estimators. Brown and Levine 
(2007) proposed a class of difference-based kernel estimators and established 
asymptotic normality. 

So far the attention has been mainly focused on nonadaptive estimation 
of the variance function, that is, the smoothness of the variance function is 
assumed to be known and the estimators depend on the smoothness. In prac- 
tice, however, the smoothness of the underlying functions is nearly always 
unknown. It is thus important to construct estimators that automatically 
adapt to the smoothness of the mean and variance functions. This is the 
goal of the present paper. Specifically, we propose a wavelet thresholding 
approach to adaptive variance function estimation in the heteroscedastic 
nonparametric regression model 

(1) yi = f{xi) + V^^'^ixi)zi, i = l,...,n, 

where Xi = i/n and Zi are independent and identically distributed with zero 
mean and unit variance. Here n = 2'^ for some positive integer J. The pri- 
mary object of interest is the variance function V{x). The estimation accu- 
racy is measured both globally by the mean integrated squared error (MISE) 

(2) i?(t/,T/) = ^||y-T/||^ 

and locally by the mean squared error (MSE) at a given point G (0, 1) 

(3) R{V{x^),V{x^)) = E{V{x^) - V{x^)f. 

It is well known that when the mean function is sufficiently smooth, it has 
no first order effect on the quality of estimation for the variance function 
V; that is, one can estimate V with the same asymptotic risk as if / were 
known. See, for example, Ruppert et al. (1997) and Fan and Yao (1998). On 
the other hand, when / is not smooth, the difficulty in estimating V can 
be completely driven by the degree of smoothness of the mean /. How the 
smoothness of the unknown mean function influences the rate of convergence 
of the variance estimator can be characterized explicitly. Wang et al. (2008) 
showed that the minimax rate of convergence under both the pointwise MSE 
and global MISE is 

(4) max{n-4°,n-2/3/(2/3+i)} 

if / has a derivatives and V has /3 derivatives. 

The goal of the present paper is to estimate the variance function adap- 
tively without assuming the degree of smoothness for either the mean func- 
tion / or variance function V. We introduce a wavelet thresholding proce- 
dure which applies wavelet thresholding to the squared first-order differences 
of the observations in (1). The procedure has two main steps. The first step 
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is taking the square of the first-order diff'erences of the observations yi . This 
step turns the problem of variance function estimation under the model (1) 
into a more conventional regression problem of estimating the mean func- 
tion. Another motivation for taking the differences is to eliminate the effect 
of the mean function /. The second step is to apply a wavelet thresholding 
procedure to the squared differences. 

The procedure enjoys a high degree of adaptivity and spatial adaptivity 
in terms of the rate of convergence both for global and local estimation. 
More specifically, under the global risk measure (2), it adaptively achieves 
within a logarithmic factor of the minimax risk over a wide range of function 
classes which contain spatially inhomogeneous functions that may have, for 
example, jump discontinuities and high frequency oscillations. The estimator 
also optimally adapts to the local smoothness of the underlying function. As 
a special case, it is shown that the variance function estimator adaptively 
achieves the rate of convergence 



under both the pointwise MSE and global MISE, if / has a derivatives and V 
has (3 derivatives. Furthermore, it is shown that the extra logarithmic factor 
in the adaptive rate of convergence in (5) is necessary under the pointwise 
MSE and the estimator is thus optimally locally adaptive. 

The wavelet estimator of the variance function is data-driven and easily 
implementable. We implement the procedure in S-Plus and R and carry out a 
simulation study to investigate the numerical performance of the estimator. 
Simulation results show that the MISE mostly depends on the structure 
of the underlying variance function and the effect of the mean function is 
not significant. In addition, we also compare the performance of the wavelet 
estimator with that of a kernel estimator whose bandwidth is chosen by cross 
validation. The numerical results show that the wavelet estimator uniformly 
outperforms the kernel estimator. 

The paper is organized as follows. After Section 2.1 in which basic nota- 
tion and definitions are summarized, the wavelet thresholding procedure is 
introduced in Sections 2.2 and 2.3. Sections 3 and 4 investigate the theoreti- 
cal properties of the estimator. In particular. Section 4.1 derives a rate-sharp 
lower bound for the adaptive rate of convergence under the pointwise squared 
error loss. The lower and upper bounds together show that the estimator is 
optimally adaptive under the pointwise loss. Section 5 discusses implemen- 
tation of the procedure and presents the numerical results. The proofs are 
given in Section 6. 
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2. Wavelet procedure for variance function estimation. In this section 
we introduce a wavelet thresliolding procedure for estimating the variance 
function V under the heteroscedastic regression model (1). We begin with 
notation and definitions of wavelets and a brief introduction to wavelet 
thresholding for estimating the mean function in the standard Gaussian 
regression setting and then give a detailed description of our wavelet proce- 
dure for variance function estimation. 

2.1. Wavelet thresholding for Gaussian regression. We work with an or- 
thonormal wavelet basis generated by dilation and translation of a com- 
pactly supported mother wavelet tp and a father wavelet (p with / (j)= 1. A 
wavelet ip is called r -regular if ip has r vanishing moments and r contin- 
uous derivatives. A special family of compactly supported wavelets is the 
so-called Coiflets, constructed by Daubechies (1992), which can have arbi- 
trary number of vanishing moments for both (p and ip. Denote by W{D) the 
collection of Coiflets {(p,ip} of order D. So if {(p,ip} € W{D), then (p and 
ip are compactly supported and satisfy / x^(p{x) fix = for i = 1, . . . , Z) — 1; 
and / x^ip{x) = for i = 0, . . . , -D — 1. 

For simplicity in exposition, in the present paper we use periodized wavelet 
bases on [0, 1]. Let 

oo 

l=—oo 
oo 

<,fc(^)= E ^jA^-^) forte [0,1], 

l=~oo 

where (pj,k{x) = 2^/'^(p{2^x - k) and ipj^kix) = 2^/'^ip{2^x - k). The collection 
{(pij^ f^,k = 1, . . . ,2^'^;^^ f,,j > Jq > 0,k = 1, . . . ,2^} is then an orthonormal 
basis of L^[0,1], provided the primary resolution level jo is large enough 
to ensure that the support of the scaling functions and wavelets at level jo 
is not the whole of [0, 1] . The superscript "p" will be suppressed from the 
notation for convenience. An orthonormal wavelet basis has an associated 
orthogonal Discrete Wavelet Transform (DWT) which transforms sampled 
data into the wavelet coefficients. See Daubechies (1992) and Strang (1992) 
for further details about the wavelets and discrete wavelet transform. A 
square-integrable function g on [0, 1] can be expanded into a wavelet series: 

2n oo 2J 

(6) gix) = J2 ^jo,k(pjo,kix) ^j,kiJj,ki^)^ 

k=l j=jo k=l 

where = {g-,(pj,k) •>^j,k = {g-,'4'i,k) are the wavelet coefficients of g. 
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Wavelet thresholding methods have been well developed for nonparamet- 
ric function estimation, especially for estimating the mean function in the 
setting of homoscedastic Gaussian noise where one observes 

(7) yi = f(^^^+azi, Zi'~-Af(0,l), i = l,...,n. 

One of the best known wavelet thresholding procedures is Donoho- Johnstone's 
VisuShrink [Donoho and Johnstone (1994) and Donoho et al. (1995)]. Atyp- 
ical wavelet thresholding procedure has three steps: 

1. Transform the noisy data via the discrete wavelet transform. 

2. Threshold the empirical wavelet coefficients by "killing" coefficients of 
small magnitude and keeping the large coefficients. 

3. Estimate function / at the sample points by inverse discrete wavelet 
transform of the denoised wavelet coefficients. 

Many wavelet procedures are adaptive and easy to implement. We shall 
develop in Section 2.2 a wavelet procedure for variance function estimation 
where the noise is heteroscedastic and non-Gaussian. 



2.2. Adaptive wavelet procedure for estimating the variance function. We 
now give a detailed description of our wavelet thresholding procedure for 
variance function estimation. The procedure begins with taking squared dif- 
ference of the observations and then applies wavelet thresholding to obtain 
an estimator of the variance function. 

Set Di = ■^{y2i~i — y2i) for z = 1,2, . . . ,n/2. Then one can write 

Di = i=(/(rE2i-l) - f{x2i) + y^/2(x2.-l)^2i-l - V^'\x2i)z2i) 

(8) 



where 5i = f{x2i-i) - f{x2i), v}^'^ = ^J\{V[x2i-i) + V[x2i)) and 

(9) = {V{x2^.l) + V{x2^)r^'\V^'\x2^-l)z2i-l " V^'\x2^)z2i) 

has zero mean and unit variance. Then can be written as 

(10) A' = yi + + V2Vl'^5iEi + Vi{ej - 1). 

In the above expression Vi is what we wish to estimate, ^5f is a bias term 

caused by the mean function /, and y/^vl^"^ 5i£i + Vi(e? — 1) is viewed as 
the noise term. By taking squared differences, we have turned the problem 
of estimating the variance function into the problem of estimating the mean 
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function similar to the conventional Gaussian regression model (7). The dif- 
ferences are of course that the noise is non-Gaussian and heteroscedastic 
and that there are additional unknown deterministic errors. In principle, 
virtually any good nonparametric regression procedure for estimating the 
mean function can then be applied. In this paper we shall use a wavelet es- 
timator for its spatial adaptivity, asymptotic optimality, and computational 
efficiency. 

We now construct a wavelet thresholding estimator V based on the squared 
differences Df. Although the procedure is more complicated, the basic idea 
is similar to the one behind the VisuShrink estimator for homoscedastic 
Gaussian regression described at the end of Section 2.1. We first apply the 
discrete wavelet transform to D = ^/2/n{Di, . . . , D'^^^Y ■ Let d = W ■ D be 
the empirical wavelet coefficients, where W is the discrete wavelet transfor- 
mation matrix. Then d can be written as 

(11) d = {djg^i, . . . , dj^ 2m , djo,l^ ■ ■ ■ ^djo,2m , dj-2,1, ■ ■ ■ > 

where (ijo^fc E'-re the gross structure terms at the lowest resolution level, and 
dj,k U = JOt ■ ■ , J — = I, . . . ,2^) are empirical wavelet coefficients at level 
j which represent fine structure at scale 2^ . For convenience, we use {j, k) 
to denote the number 2^ + k. Then the empirical wavelet coefficients can be 
written as 

dj,k '^j,k ~l~ ^j,k 

where Zj^k denotes the transformed noise part, that is, 

^J-.fc = V^E W^^j^k)A^vl'''5iei + Viief - 1)) 

i 

and 

rj,k = Oj,k + E ^ihk),i\[^^f + 7i,fc- 

i 

Here Oj^k is the true wavelet coefficients of V[x), that is, Oj^k = {y,ipj,k), and 
7j^fc is the difference between 6j^k and the discrete wavelet coefficient of Vi, 

7j,k = J2 ^U,k),i^i - ^hk- 
i 

We shall see that the approximation error 7^ ^ is negligible. 

For the gross structure terms at the lowest resolution level, similarly, we 
can write 

djo,k = Tjo,k + Zjo,k, 
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where 



(12) rjo,fc = ^ Wi^jo,k),i\[^^i + ^jo,k + 7jo,k, 



(13) z,,,, = \/lT.Wuo,k),i{^^l^'^^^^ + - 1))' 



with = (y, and 7^0,^ = J2i W^(io,fc),i^i " Oo,fc- 

Note that the squared differences are independent and the variance 
ajf^ of the empirical wavelet coefficients dj^k can be calculated as follows: 



2 "''^ 

(14) 4k = Var(d,- fc) = - ^ W^g ^ Var(A'). 

j 

We shall use this formula to construct an estimator of cj|^ and then use it 
for choosing the threshold. 

For any y and t > 0, define the soft thresholding function r]t{y) = sgn(y)(|y| - 
t)-l_. Let Ji be the largest integer satisfying 2-^^ < J~^2^ . Then the 9j^k are 
estimated by 



0, otherwise, 



(15) e,,k 

where 

(16) Aj-fc = (7j,fcy'21og(n/2) 

is the threshold level, and aj f^ is an estimate of the standard deviation (Tj^k- 
We shall discuss estimation of aj ^ in Section 2.3. 

In wavelet regression for estimating the mean function, the coefficients 
of the father wavelets (pj^^k at the lowest resolution level are conventionally 
estimated by the corresponding empirical coefficients. Since there are only a 
small fixed number of coefficients, they would not affect the rate of conver- 
gence and the numerical results show that the wavelet estimators perform 
well in general. But in the setting of the present paper for estimating the 
variance function it turns out that using the empirical coefficients directly, 
although not affecting rate of convergence, often does not yield good nu- 
merical performance. We therefore estimate these coefficients by a shrinkage 
estimator given in Berger (1976). The estimator depends on the covariance 
matrix of the empirical coefficients dj^ = ((ij,),i, ■ . ■ ,dj^ 2ny which is a func- 
tion of the means and thus unknown. We shall use an estimated covariance 
matrix. Specifically, the estimator of is given by 

- f min{Z„ S-i(i,„,2Jo -2}S-i\ ~ 
17 e,o =[1 . ^ ~ dj, , 
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where = (Cjo,i' Oo,25 • • • ) Cjo,2Jo )' is the estimator and S is the estimated 
covariance matrix of dj^. In our problem, we set 

where is the father wavelets part of the discrete wavelet transform 
matrix W. That is, Wj^ is a 2-^o x ^ matrix and in our setting Wj^ consists 
of the first 2^^^ rows of W. Vd is a diagonal matrix given by 

Vd = Diag{V^2)^ v^i), . . . , Va'^Dj/^)} 

with Var(D?) given in equation (20) in Section 2.3. 

With Oj^k given in (15) and ij^^k in (17), the estimator of the variance 
function V is defined by 

2^0 Ji 2J 

(18) Ve{x) = ^ijo,k4>jo,k{x) + XI H ^j,k'4'j,k{x). 

k=l j=jo k=l 

So far we have only used half of the differences. Similarly we can apply the 
same procedure to the other half of differences, Z)^ = -^{y2i — y2i+i), and 

obtain another wavelet estimator Vo{x). The final estimator of the variance 
function V is then given by 

(19) V{x) = l{Ve{x) + Vo{x)). 

The variance function at the sample points y_ = {V{^) : i = 1, . . . ,n/2} 
can be estimated by the inverse transform of the denoised wavelet coeffi- 
cients: V = W-^ ■ (§)^/20_ 

Remark. In principle other thresholding techniques such as BlockJS 
[Cai (1999)], NeighBlock [Cai and Silverman (2001)] and EbayesThresh 
[Johnstone and Silverman (2005)] can also be adopted for estimating the 
wavelet coefficients here. However in the setting of the present paper the 
empirical coefficients are non-Gaussian, heteroscedastic and correlated and 
this makes the analysis of the properties of the resulting estimators very 
challenging. 

2.3. Estimate of Uj j^. Our wavelet estimator (19) of the variance func- 
tion V requires an estimate of variance a^f^ of the empirical wavelet coef- 
ficients. To make the wavelet estimator V perform well asymptotically, we 
need a positively biased estimate of o"?^. That is, the estimate cr?^. is greater 
than or equal to aj^ with large probability. This can be seen in the proof 
of our theoretical results. On the other hand, the estimate o"?^ should be of 
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the same order as fj?^. Combining the two requirements together we need 

an estimate cj?^ such that PiClj k^j k — ^jk — ^^'j k) ~^ ^ some constant 
C > 1. We shah construct such an estimate here. 

It is clear from equation (14) that an estimate of the variance of D 



2 

i ' 

i = l,2,...,n/2, yields an estimate of cr|;,. It follows from equation (10) 

that we need an estimate of -E(ef ), where = \/^vl^^Siei + Vi{ef — 1). Note 
that Df = Vi + \5f + ei where 5i is "small" and Vi is "smooth." We shah 
estimate Var(L'?) = E{ef) by the average squared differences of over an 
subinterval. Specifically, we define the estimator of Var(L'?) as follows. 

Let Ai = Dij-i - -^2i foi^ i = 1,2, . . . ,n/4. Fix < r < 1 and divide the 
indices 1,2,..., n/2 into nonoverlapping blocks of length [(§)''] . We estimate 
Var(Z)?) = E{ef) in each block by the same value. Let K be the total number 
of blocks and Bk be the set of indices in the kth block. For 1 < k < K , let 

(20) VI^CD?) ^ = , i^^rJ ,n T E for all ^ e Bk. 

(n/2)^(2- 1/logn) ^f-^^ 

Lemma 6 in Section 6 shows that this estimate has the desired property for 
any < r < 1. With this estimator of Var(D|), we estimate fj?^. by 

(21) ^l,. = -E^S>).Var(A^). 

i 

We shall use dj^k in the threshold Xj^k in (16) and construct the wavelet 
estimator of V as in (18) and (19). 

3. Global adaptivity of the wavelet procedure. We consider in this sec- 
tion the theoretical properties of the variance function estimator V given 
in (19) under the global MISE (2). The local adaptivity of the estimator 
under pointwise MSE (3) is treated in Section 4. These results show that 
the variance function estimator (19) is nearly optimally adaptive and spa- 
tially adaptive over a wide range of function spaces for both the mean and 
variance functions. 



3.1. Inhomogeneous function class 7i. To demonstrate the global adap- 
tivity of the variance function estimator V, we consider a family of large func- 
tion classes which contain spatially inhomogeneous functions that may have, 
for example, jump discontinuities and high frequency oscillations. These 
function classes are different from the more traditional smoothness classes. 
Functions in these classes can be viewed as the superposition of smooth func- 
tions with irregular perturbations. These and other similar function classes 
have been used in Hall, Kerkyacharian and Picard (1998, 1999) and Cai 
(2002) in the study of wavelet block thresholding estimators. 
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Definition 1. Let W = 'H(ai,a,7,Mi,M2,M3,L),t;), where < ai < 
Oi< D, 7 > 0, and Mi, M2, M^^v > 0, denote the class of functions / such 
that for any j > Jo > there exists a set of integers Aj with card(^j) < 
M2,2^'^ for which the following are true: 

• For each k £ Aj, there exist constants ao = f{2~^k),ai, . . . , a£)„i such that 
for all X £ [2^^k,2-i{k + v)], - Em=oam(x - 2-JA:)™| < Afi2-J"i. 

• For each k ^ Aj, there exist constants ao = f{2~^k),ai, . . . ,aD-i such that 
for ah X e [2-ik, 2-^{k + v)], - E^=oam(x - 2-JA;)™| < M22"J". 

A function / E TC{ai,a,^, Mi, M2, M3, D,v) can be regarded as the su- 
perposition of a regular function fg and an irregular perturbation t: f = 
fs + T. The perturbation r can be, for example, jump discontinuities or 
high frequency oscillations such as chirp and Doppler of the form: r(x) = 
J2k=iO'k{x — Xk)^'' cos{x — Xk)~'^^- The smooth function fg belongs to the 
conventional Besov class B^^{M2). Roughly speaking, a Besov space B^^^ 
contains functions having a bounded derivatives in space, the parameter 
q gives a finer gradation of smoothness. See Triebel (1983) and Meyer (1992) 
for more details on Besov spaces. 

Intuitively, the intervals with indices in Aj are "bad" intervals which 
contain less smooth parts of the function. The number of the "bad" in- 
tervals is controlled by M3 and 7 so that the irregular parts do not over- 
whelm the fundamental structure of the function. It is easy to see that 
7i{ai,a, 7, Mi, M2, M^, D, v) contains the Besov class B^^{M2) as a subset 
for any given ai, 7, Mi, M3, D and v. See Hall, Kerkyacharian and Picard 
(1999) for further discussions on the function classes 7i. 

3.2. Global adaptivity. The minimax rate of convergence for estimating 
the variance function V over the traditional Lipschitz balls was derived in 
Wang et al. (2008). Define the Lipschitz ball A"(M) in the usual way as 

A''{M) = {g:\g^''\x)\ <M, and |5^L"J)(a;) _ ^(N)(y)| <M|x-y|"' 

for aU < x, y < 1, A; = 0, . . . , [aJ - 1} 

where [aJ is the largest integer less than a and a' = a — [a\ . Wang et al. 
(2008) showed that the minimax risks for estimating V over / G A"(Mj) 
and V G A^(Mv) under both the global MISE and the MSE at a fixed point 
x^ € (0,1) satisfy 

inf sup E\\V-V\\l 

V /GA"(Af/),FeA'3(Mv) 

(22) xinf sup E{V{x^)-V{x^)f 

V /eA"(Af/),VeA'9(Afv) 



xmax{n-4-,n-2/5/{/^+i)}. 
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We now consider the function class 7i{ai,a,'y, Mi, M2, M3, D,v) defined in 
Section 3.1. Let TCf{a) =7i{ai,a,^f,Mfi,Mf2,Mf3,Df,Vf) and7iv{f3) = 
n{Pi , p, -fv,Mvi , Mv2,Mv3, Dv,vv). Since n{ai , a, 7, Mi , M2, M3, D, v) con- 
tains tlie Lipschitz ball A"(M2) as a subset for any given ai, 7, Mi, M3, 
D, and v, a minimax lower bound for estimating V over / G 7if{a) and 
V £ HviP) follows directly from (22): 

(23) inf sup E||1/-y||^>C7-max{n-^",n-2/5/(i+2/3)}. 

V fe'Hf{a)yenvm 

The following theorem shows that the variance function estimator V is 
adaptive over a range of the function classes 7i. We shall assume that the 
error Zi in the regression model (1) satisfies the property that the moment 
generating function of zf, G{x) = E{e^^i), exists when |x| < p for some 
constant p> 0. This condition implies that the moment generating function 
of £i in (9), Ge{x) = E{e^^'), exists in a neighborhood of 0. 

Theorem 1. Let {yi, . . . ,yn} be given as in (1). Suppose the wavelets 
{(/>,'?/'} G W{D) and the moment generating function of zf exists in a neigh- 
borhood of the origin. Suppose also 7j < 1 -|-4ai — 4a, and < ^1+213 ■ ^^^^ 
the variance function estimator V given in (19) satisfies that for some con- 
stant Co > and all < (3 < D 



Remark. The use of Coiflets in Theorem 1 is purely for technical rea- 
sons. If the following mild local Lipschitz condition is imposed on functions 
in 7i in regions where the functions are relatively smooth, then the Coiflets 
are not needed. Local adaptivity result given in the next section does not re- 
quire the use of Coiflets and our simulation shows no particular advantages 
of using Coiflets in the finite sample case. 



(i) If a > 1 > Qi, then for k ^ Aj, \f{x) - f{2-^k)\ < M42-J', for x e 
[2~^k,2~^{k + v)]. 



(ii) If a > ai > 1, then |/(x) - f{2-^k)\ < Mi2-^ , for x G [2"^ fc, 2-J(fc + 
v)]. 



Comparing (24) with the minimax lower bound given in (23), the estima- 
tor V is adaptively within a logarithmic factor of the minimax risk under 
global MISE. Thus, the variance function estimator V, without knowing 
the a priori degree or amount of smoothness of the underlying mean and 
variance functions, achieves within a logarithmic factor of the true optimal 
convergence rate that one could achieve by knowing the regularity. 



(24) 



sup 
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For adaptive estimation of V over the traditional Lipschitz balls, the 
following is a direct consequence of Theorem 1. 



Corollary 1. Let {yi, ■ ■ ■ ,yn} be given as in (1). Suppose the wavelet 
ip is r -regular and the moment generating function of zf exists in a neigh- 
borhood of the origin. Then the variance function estimator V given in (19) 
satisfies that for some constant Cq > and all < (3 <r 

(25) sup ^||1/-y|||<Co-max(n- 




4. Local adaptivity. For functions of spatial inhomogeneity, the local 
smoothness of the functions varies significantly from point to point and 
global risk measures such as (2) cannot wholly reflect the performance of an 
estimator locally. The local risk measure (3) is more appropriate for mea- 
suring the spatial adaptivity, where E (0, 1) is any fixed point of interest. 

Define the local Lipschitz class A"(M, a;*,(^) by 

A"(M,x„5) = {5: |5^L"J)(^) _^(N)(a;^)| < m\x - x.^' , 

X e {x^ — 6,x^ + 6)} 
where [a\ is the largest integer less than a and a' = a — [a\ . 



Theorem 2. Let {yi, . . . ,yn} be given as in (1). Suppose the wavelet tp is 
r-regular and the moment generating function of zf exists in a neighborhood 
of the origin. Then the variance function estimator V given in (19) satisfies 
that for any fixed x.^ £ (0,1) there exists some constant Cq > such that for 
all a> and all < f3 < r 

sup E{V{x^)-V{x^)f 

f(^A°'{Mf,x,,Sf),Vf^Af(Mv,X:,,Sv) 

(26) 

< Co • max< n 



n 



1- 



Comparing (26) with the minimax rate of convergence given in (22), the 
estimator V is simultaneously within a logarithmic factor of the minimax 
risk under the pointwise risk. We shall show in Section 4.1 that, under the 
pointwise risk, this logarithmic factor is unavoidable for adaptive estima- 
tion. It is the minimum penalty for not knowing the smoothness of the 
variance function V. Therefore the estimator V is optimally adaptive under 
the pointwise loss. 
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4.1. Lower bound for adaptive pointwise estimation. We now turn to the 
lower bound for adaptive estimation of the variance function V under the 
pointwise MSE. The sharp lower bound we derive below demonstrates that 
the cost of adaptation for variance function estimation behaves in a more 
complicated way than that for mean function estimation. 

It is well known in estimating the mean function / that it is possible 
to achieve complete adaptation for free under the global MISE in terms of 
the rate of convergence over a collection of function classes. That is, one 
can do as well when the degree of smoothness is unknown as one could 
do if the degree of smoothness is known. But for estimation at a point, 
one must pay a price for adaptation. The optimal rate of convergence for 
estimating the mean function / at point over A"(Mj) with a completely 
known is 72-20/(1+20!)^ ^.j^^ setting of adaptive estimation of the mean 
function, Lepski (1990) and Brown and Low (1996) showed that one has to 
pay a price for adaptation of at least a logarithmic factor even when a is 
known to be one of two values. It is shown that the best achievable rate is 
|-lo£n'j2a/(i+2«) ^ ^j^^^ ^j^^ smoothness parameter a is unknown. 

Here we consider adaptive estimation of the variance function V at a 
point. The following lower bound characterizes the cost of adaptation for 
such a problem. 

Theorem 3. Let ao,ai > 0, (3o > Pi > and 4ao > j^^- Under the 

regression model (1) with Zi A^(0, 1), for any estimator V and any fixed 
X* e (0,1), if 

liE min{n^"«,n2^o/(^+2*)} 

(27) 



X sup E{V{x^:) -V{x^))'^ <oo 

/eA"o (Mf),V£Af'o (My) 

2/3i/(l+2/3i) 



then 



(28) 



lim min< n"^"^, 
?i— +00 



n 



logn 



X sup 

/6A"i(M/),yeAft (Mv) 



The lower bound for adaptive estimation given in Theorem 3 is more com- 
plicated than the corresponding lower bound for estimating the mean func- 
tion given in Lepski (1990) and Brown and Low (1996). Theorem 3 shows 
that, if an estimator is rate optimal for / E A""(Mj) and V G A^'-''{My), then 
one must pay a price of at least a logarithmic factor for / G A"^{Mf) and 
V £ A^i(Mv) if the mean function is smooth, that is, 4ai > j^f^- On the 
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other hand, if 4ai < ji^^^^, then it is possible to achieve the exact minimax 
rate simultaneously both over / G A"" (M/) , V £ A* (My) and / G A"i (M/) , 
V G A'^i(My). In contrast, for estimating the mean function at a point one 
must always pay a price of at least a logarithmic factor for not knowing the 
exact smoothness of the function. 

Comparing the lower bound (28) with the upper bound (26) given in 
Theorem 2, it is clear that our wavelet estimator (19) is optimally adaptive 
under the pointwise risk. The lower and upper bounds together show the 
following. When the mean function is not smooth, that is, 4a < the 
minimax rate of convergence can be achieved adaptively. On the other hand, 
when the effect of the mean function is negligible, that is, Aa > j^^, the 
minimax rate of convergence cannot be achieved adaptively and one has 
to pay a minimum of a logarithmic factor as in the case of mean function 
estimation. 

The proof of this theorem can be naturally divided into two parts. The 
first part 

(29) lim • sup E{V{x^) - V{x^)f > 

n^oo f(^A^i{Mf),VeAl^i{Mv) 

follows directly from the minimax lower bound given in Wang et al. (2008). 
We shall use a two-point constrained risk inequality to prove the second 
part, 

/ ^ N 2/3i/(l+2/3i) 

(30) lim • sup E{Vix^)-V{x^)y >0. 

n-*oo\lOgnJ f£A°'i{Mf)y£A^iiMv) 

A detailed proof is given in Section 6.4. 

5. Numerical results. The adaptive procedure for estimating the vari- 
ance function introduced in Section 2.2 is easily implementable. We im- 
plement the procedure in S-Plus and R. In this section we will investigate 
the numerical performance of the estimator. The numerical study has three 
goals. The first is to investigate the effect of the mean function on the esti- 
mation of the variance function. Several different combinations of the mean 
and variance functions are used and the MSE of each case is given. The 
second goal is to study the effect of different choices of r in (20) on the per- 
formance of the estimator. The simulation results indicate that the MISE 
of the estimator is not sensitive to the choice of r. Finally, we will make a 
comparison between the wavelet estimator and a kernel estimator with the 
bandwidth chosen by cross validation. For reasons of space, we only report 
here a summary of the numerical results. See Cai and Wang (2007) for more 
detailed and additional simulation results. 
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Four different variance functions were considered in the simulation study. 
They are Bumps and Doppler functions from Donoho and Johnstone (1994) 
and also the following two functions: 



These test functions are rescaled in the simulations to have the same L2 
norm. 

We begin by considering the effect of the mean function on the estima- 
tion of the variance function. For each variance function V{x), we use five 
different mean functions, the constant function /(x) = 0, the trigonomet- 
ric function / = sin(20x), and Bumps, Blocks and Doppler functions from 
Donoho and Johnstone (1994). Different combinations of wavelets and sam- 
ple size n yield basically the same qualitative results. As an illustration, 
Table 1 reports the average squared errors over 500 replications with sample 
size n = 4096 using Daubechies compactly supported wavelet Symnilet 8. In 
this part, we use r = 0.5 in (21). Figure 1 provides a graphical comparison 
of the variance function estimators and the true functions in the case the 
mean function / = 0. 

It can be seen from Table 1 that in all these examples the MISEs mostly 
depend on the structure of the variance function. The effect of the mean 
function / is not significant. For Bumps and Blocks, the spatial structure 
of the mean / only affect a small number of wavelet coefficients, and the 
variance function estimator still performs well. But still, when / is smooth, 
the estimator of the variance function V is slightly more accurate. We can 
also see that the results here are not as good as the estimation of mean 
function under the standard homoscedastic Gaussian regression model. This 
is primarily due to the difficulty of the variance function estimation problem 
itself. 

Table 1 

The average squared error over 500 replications with sample size n = 4096 



' 3 - 30x 
20a; -1 



for < X < 0.1, 




for 0.1 <x< 0.25, 
for 0.25 <x< 0.725, 
for 0.725 < X < 0.89 
for 0.89 < X < 1, 




f{x) = sin(20a;) 



Bumps 



Blocks 



Doppler 



V2{X) 

Bumps 



0.0817 
0.0523 
0.1949 
0.4162 



0.0842 
0.0553 
0.2062 
0.5037 



0.0825 
0.0557 
0.2146 
0.4817 



0.0860 
0.0563 
0.2133 
0.4888 



0.0837 
0.0567 
0.2060 
0.4902 



Doppler 
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We now turn to the choice of r in (21). Using the same setting as in the 
previous example, we apply our procedure for the four test functions with 
three different choices of r in (21), r = 0.2,0.5 and 0.8, respectively. The 
mean function is chosen to be / = 0. The average squared error over 500 
replications are given in Table 2. 

For each test function the MISEs are nearly identical for different choices 
of r. It is thus clear from Table 2 that the performance of the estimator is 
not sensitive to the choice of r. We suggest use r = 0.5 in practice. 

After taking squared differences, the problem of estimating the variance 
function becomes the problem of estimating the mean function and virtually 
any good procedure for estimating the mean function can then be applied. 
We now compare the performance of our wavelet estimator with a kernel 



Table 2 

The MISEs for different choices of r 





Viix) 


V2{x) 


Bumps 


Doppler 


r = 0.2 


0.0838 


0.0581 


0.1981 


0.4852 


r = 0.5 


0.0817 


0.0523 


0.1949 


0.4162 


r = 0.8 


0.0859 


0.0532 


0.2065 


0.4335 
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Table 3 

Comparison of the MISEs for the wavelet and kernel estimators 









Bumps 


Doppler 


Wavelet 


0.0817 


0.0523 


0.1949 


0.4762 


Kernel 


0.1208 


0.0631 


0.2296 


0.5463 



estimator whose bandwidth is chosen via cross-vahdation. Table 3 displays 
the average squared errors over 500 replications of the two estimators for 
the four variance functions with the mean function / = 0. 

The wavelet estimator outperforms the kernel estimator for all the vari- 
ance functions. The MISEs of the kernel estimator are 14% to 47% higher 
than the corresponding wavelet estimator. Although the bandwidth of the 
kernel estimator is chosen adaptive via cross-validation, the spatial inho- 
mogeneity of the variance functions limits the performance of any kernel 
method with a single bandwidth. 

In summary, the simulation study shows that the effect of the mean func- 
tion on the performance of the wavelet estimator is not significant. In this 
sense our wavelet procedure is robust against the mean function interference. 
The procedure is also not sensitive to the choice of r. In addition, the wavelet 
estimator uniformly outperforms the kernel estimator whose bandwidth is 
chosen by cross-validation. 

6. Proofs. We begin by introducing and proving several technical lem- 
mas in Section 6.1 that will be used in the proof of the main results. Through- 
out this section, we use C (as well as Co, Ci, etc.) to denote constants that 
may vary from place to place. 

6.1. Preparatory results. Oracle inequality for the soft thresholding es- 
timator was given in Donoho and Johnstone (1994) in the case when the 
noise is i.i.d. normal. In the present paper we need the following risk bound 
for the soft thresholding estimator without the normality assumption. This 
risk bound is useful in turning the analysis of the variance function estima- 
tor into the bias-variance trade-off calculation which is often used in more 
standard Gaussian nonparametric regression. 

Lemma 1. Let y = 9 + Z , where 9 is an unknown parameter and Z is a 
random variable with EZ = 0. Then 

E{ri{y, A) - 9f < 9^ A (AX"^) + 2E{Z^I{\Z\ > A)). 

Proof. Note that 

E{rj{y, A) - 9f < 2E{r]{y, A) - yf + 2E{y - 9f < 2X^ + 2EZ^ 
<A\^ + 2E{Z'^I{\Z\ > A)). 
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On the other hand, 

E{r]{y, A) - Of = 9^P{-X -e<z <\-e) + E{{Z - xfi{z >\-e)) 

+ E{{z + \fi{z <-\-e)) 

<e'^ + E{{Z - \fl{Z > A)) + E{{Z + \fl{Z < -A)) 

<e'^ + E{z^i{\z\> X)). □ 

The following lemma bounds the wavelet coefficients of the functions in 

n. 

Lemma 2. (i) Let g G 7i{ai,a,^,Mi,M2,M3,D,v). Assume the wavelets 
G W{D) with supp((/>) = supp('i/') C [0,i;]. Let n = 2'^ , = /^J,^ 
and 9j^k = J gipj^k- Then 

\ij,k-n-^'''g{k/n)\<Mi\\(t>hn~'^^l^+^^^ for all k e Aj; 

\L,k - n-^'^g{k/n)\ < M2||0||in-(i/2+") /^r all k i Aj- 

\Oj,k\ < Mi||V'||i2-J'(^/2+ai) k^Aj; 

\dj,k\ < Mi||V'||i2--''(i/2+a) 

(ii) For all functions g G A"(M), the wavelet coefficients of g satisfy \ 
dj,k |< C2~-'^"^/^^") where constant C depends only on the wavelets, a and 
M only. 

Lemma 2(ii) is a standard result; see for example, Daubechies (1992). For 
a proof of Lemma 2(i), see Hall, Kerkyacharian and Picard (1999) and Cai 
(2002). It follows from this lemma that 



(31) sup Y.[Uk-n-'''g(-]] <C7n-(2/^-i). 



n 



The next lemma gives a large deviation result, which will be used to 
control the tail probability of the empirical wavelet coefficients. 

Lemma 3. Suppose Si, i = l,2,..., are independent random variables 
with Eei = 0, Var(ej) = Vi < vq for all i. Moreover, suppose the moment 
generating function M^{x) = E{exp{xei)) exists when \x\ < p for some p> 
and all i. Let 

^ m 
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with J2i^i(^rni ~ ^ '^'^^ \0'mi\ ^ cq/v^ /^^^ soiTie Constant cq, then for X = 
o{m~^/^) and sufficiently large m 

Pi\Z„^\>amX) ^ A3 
< exp 



2(1 -$(A)) 

where a'^ = J^o-mi'^i/^o o,f^d C > is a constant. 

Proof. Let Sm = m}/'^Zm. and Mm{x) = E{ex.p{xSm))- Suppose /ijfc 
denote the kth moment of £i. Note that \ami\ < com~^^^ for any m and 
1 < i < m, we have 

log(M„,(x)) 1 ^ ■/ Omim^^^ 



^1/2 m^/^ ^ \ ^vq 



X- 



1 m . ,Til/2„2 rr73/4„3 



i=l 

where Am(2;) is uniformly bounded for all m when x < p. This means that 
Mm{x) can be written in the form M^(x) = e'"'^'(^V2«o)(i + 0(^-^4)) 
for |x| < p. It then follows from Theorem 1 of Hwang (1996) that for A = 
o{m~^/^) and sufficiently large m 

2(i-<i>(A)) -^^pi,^,„i/4;^^+^^"^ ))■ □ 

A special case of Lemma 3 is when m > (logn)^' for a positive integer n 
and some k> 2 and A = ^/2Togn. In this case, we have 

Pi\Zm\>amV2W^) / (21ogn)3/2 w / V2b^ \\ 

2(1 - $(V2bi7^)) -''''^l (logn)'^/4A V(logn)^'/4yy- 

Since k > 2, exp{(logn)3/^~'^/^} = o(n") for any a > as n — > oo. Therefore 
(32) P{\Z„^\>a„^^/2\ogn)<0^ ^ 



for any a > 0. 

The following two lemmas bounds the difference between the mean Tj^k 
of the empirical wavelet coefficient dj^k and the true wavelet coefficient Oj^k, 
globally and individually. 
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Lemma 4. Using the notation in Section 2.2, and under the conditions 
of Theorem 1, we have 



sup 

/6W/{Q),VeWv(/3) 



{ k j,k ) 



= (9(^-(lA4aA2/3A{l+2/3i-7y))-, 

Proof. Note that 



E(^io,fc - Cjo,k? + Y.(Hk - Oj,k? = E (e w^j,k),i]l^sf + 7,,fc] 

k j,k j,k \ i / 

<2EfE^o>).\/J^.^)' + 2E7l.. 

j,k \ i / j,k 

It follows from the isometry property of the orthogonal wavelet transform 
that 



From the definition of the function class if 2z — 1 G Aj then 5i < Cn~(^^"i) 
for some constant C > 0; if 2i — 1 ^ Aj, 5i < Cn~^^^"'^ for some constant 
C > 0. This means 



n n 



On the other hand, 

2E7I. = 2E (E Wu,k)A^^ - Vi2z - 1)) 

j,k j,k \ i 

+ E^o»,*(^(2^-i)-M') 
<4EfE^ow(^^-^(2^-i))) 

j,k \ i / 
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j,k \ i J 
i j,k \ i / 

Similarly to the previous calculation, we have 

^(y. _ v{2i - l)f = 0(n^^-i-2(i^/^i) + n-''^''^^^). 

i 

It follows from Lemma 2 that 

j,k \ i / 
k 

= E i(j/2,k-V{2i-l)f+ (.0/2,k-V{2i-l)f 

The lemma is proved by putting these together. □ 

Lemma 5. Using the notation in Section 2.2, for any S (0, 1), 

sup E(^io,fc - Go,fc)'Ajo,fc(^*) 

f&K'^{Mf,x,,&)y&KP{Mv,x,,5)\ 



= C)(„-(4aA2/3Al))_ 

Proof. It follows from the property of DWT that, 



\ k j,k 



2 



V 

Note that (p{x) has compact support, say supp((/)) C [—L,L]. So 

0) only if ^ (2;* ~ IT'^* + This means in the previous summation 
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only need to consider those i's for which ^ G {x^ — ^,x^, + ^). For those 
i, supp(i;^j_i^j) C (x* — (5, X* +6) for ah sufficiently large n. On the interval 
(x* — (5, X* + 5), both /(x) and V{x) has Lipschitz property and the lemma 
now follows from (31). □ 

Lemma 6 below shows that the estimator Var(Z>?) given in (20) has the 
desired property of being slightly positively biased. 

Lemma 6. Suppose V{x) is bounded away from zero and Zi 's are i.i.d. 
random variables. Suppose af, is the estimator mentioned in Section 2.3. 
Then for any m> there exist constants Cm > such that 

p(f] n {E{e^)<^l<me^)))>l-Cmn-"^. 

\ k isblock k / 



1\/3A1 /ix2{aAl) 

EA, = V^,_,-V^,<2Mv{-) +MA-\ 



Proof. Let Uk denote the kth moment of Zi. It is easy to see that 

■)' 

EAj = {V^,^, - Vi,f + E{el^,) + E{el). 
Since E{ef) = V^{ui - 1) + 25fVi + 2^/2V^^ SiU^,, we know that 

E{e^)-E{e])<cJi\^] +^"^ ^ 



n J \ n 



for some constant Cq. Denote by the set of indices in block k. Let 
Uk = maxi^BklEief)}. Then for any j £ Bk 

oJk - E{e]) < C7o(n-(i^'-)(^^i) + n-(i-^)("^^)) < Con-(i-'-){"A/3Ai) 
and hence 

^(^i)= .._wJ.V./oV E m^-i-yL? + E{el_,)+E{el)) 

2i(iBk 



2 


(2- 


l/logn)(n/2)'^ 




2 


(2- 


l/logn)(n/2)^- 




2 



> 



-(2-l/logn)(n/2)-2i^^^ ^ 
2—1/ log n 2 — 1/ log n 
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Since V{x) is bounded away from zero, we know that LOk>C for some con- 
stant C > 0. So when n is sufficiently large, there exists some constant Ci > 
such that 

a;fc Con-(i-^)(°^'3Ai) > Ci/logn. 

2-1/logn 2-1/logn ' 

Since all the moments of Cj exist, all the moments of Aj exist. Then for any 
fixed positive integer I, 

= P{al-E{al)>u,-E{al)) 

~ V V2-l/logn 2-1/logn // 

>P(|a2-i?(a2)|<C7i/logn) 



^ E[{al-E{al)r] 
(Ci/logn)2' 



(Ci / log n)2'n2'"' (2-1/ log n) 



2/ 



i^f 5: (A2-ii;A2)) 

\2ieSfe / 



2Z 



Since A^'s are independent random variables, we know that E{J22teBk 
i?A2))2' is of order (n'')' for sufficiently large n. This means 

(33) P(ai>..) = l-0^(^°^")" 

So 



. fc=l 



n 



(«+i)r-i r 



Since I is an arbitrary positive integer, this means for any m > there exists 
a constant Cm > such that 



\ k iG block k / 



Similarly, we know that for any m > there exists a constant > such 
that 



V k jGblock'fc / 



?1 

□ 



A direct consequence of Lemma 6 is that PiClj^k^'jk — ^'j k — ^'^j k) — 
1 — Cmn~"^ for any ni> and some constant Cm- 



24 T. T. CAI AND L. WANG 

6.2. Upper bound: Proof of Theorem 1. It is clear that the estimators T4, 
Vo, and thus V have the same rate of convergence. Here we will only prove 
the convergence rate result for Ve ■ We shall write V for Ve in the proof. Note 
that 

E\\V-V\\l^=EY^{i,,,-^,,,f 

i=l 

(34) 

Jl oo 

j=jo k j=Ji+l k 

There are a fixed number of terms in the first sum on the RHS of (34). Equa- 
tion (13) and Lemma 4 show that the empirical coefficients dj^^k have vari- 
ance of order and sum of squared biases of order (9(77,-(l^4oA2/3A(l+2/3l-7y))^ 
Note that 7^/ - 1 - 2/3i < so 

sup SE(4o. - Cio.)' = 0(n-(i^^"^2/3A(i+2/5, ^ o(^-i) 

2/3/(1+2/3) N 



/ ^ / J. \ -:^/3/ii+:^/sj\ 



Also, it is easy to see that the third sum on the RHS of (34) is small. Note 
that for 9j k = {V,ipj,k)^ from Lemma 2, 



00 



E E^l.= E E^l.+ E^iJ 

j=Ji+i fc j=Ji+i VfceAj fc^^j / 

00 

< (Ci2^(T^-i-2ft) + C22-2i/') 

// ^ X -2/3/(1+2/3)n 

VVlogn/ 

We now turn to the main term EJ2jLjg J2ki^j,k — Gj,k)'^- Note that 

Jl Jl Jl 

E E Y.(hk - Oj,kf < 2^ E Jlihk - rj,k? + 2^ E - Oj,kf. 

j=jo k j=jo k j=jo k 

The second term is controlled by Lemma 4. We now focus on the first 
term. Note that the thresholds Xj^k are random. We shall denote by -E'|a(") 
the conditional expectation given all the thresholds Xj^k- It follows from 



ADAPTIVE VARIANCE FUNCTION ESTIMATION 



25 



Lemma 1 that 

< E{Tlk A 4A2,) + E{zlj{\zj^k\ > Xj,k)). 

Note that 

E{zj^kH\zj,k\ > Aj- fc)) 

= Eizlkli\zj^k\ > \j,k)I{^lk > <^lk)) 
+ E{zlj,I{\zj^k\ > ^j,k)I{^j,k < ^j,k)) 



< E{zlkl{\zj,k\ > aj,kV2]^)) + EizlkH^ik < 4k)) 
= Si + 52. 

Set pj^k = o"j^fc\/21ogn. Since the moment generating functions of all zf exist 
in a neighborhood of the origin, there exists a constant a > such that 
E{e'^^i'^/"^'^) < oo. Let ^ = Clogn for some constant C > max(l/a, 1), then 

Si = E{zlj{{Aal^ V p]^k) > \zj,k\ > Pj,k)) 

( z^ 



n 



< {A^alk V plk)P{Kk\ > (Tj,kV2log 
2 ^'v4(logn)^ 

i.'^ ga(AV21ogn) ' ' 

Note that, when 2^ < n/(log(n/2))^, each wavelet coefficient at level j is 
a linear combination of m > (log?i)^ of the Ui's. It then follows from Lemma 
3 and (32) that 



fcl > cT,,fey^21og(n/2)) < 0(n-(i-'^)) 

for any a > 0. Also, 

V4(logn)2 ^ C^log^n ^ ^^^log^n 



oa(AV 21ogn) — gaClogn 



n 



Combining these together, and since aj = 0{l/n), we have 

Si<o( 



This means is negligible as compared to the upper bound given in (25). 

It is easy to see that ^2 < {E{z^f,)P{d'j^^ < <7|fc))^'^^- Lemma 6 yields 
-P(<7|fc <<y'jk)~ 0{n~"^) for any m > 0. So 52 is also negligible. 
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We now turn to -E(T|fc A 4A|^). Note that 
(35) E{tI, a 4A2 fc) < 2(r,- - Oj^k? + m^lk A ^\lk)- 

The first part is controlled by Lemma 4. For the second part, 

E{elk A 4A2 ,) < E{elk A (4 X Apl^)) + E{9l,Iial, > 4^,)) 
< 4(91, A 4/,|,,) + 9l„P{al„ > 4al,). 

Note that 6*1^ is bounded (see Lemma 2). From Lemma 6, P{o'j > 4c|fc) = 
0(?i~"^) for any m > 0. So 9jf^P{ajj^ > 4a'j f,) is negligible as compared to 
the upper bound in (25). 

We now turn to 6l| ;. A4p|fc, note that j > ^^^^^ implies 2^ > (l^)^/^^'^+^^ 
and 



A ^pIu < Oh < C2-^^'^m , if , > and k^A,, 

This means 



logn\ . J-log2 J 



77/ 

j<(J-log2 J)/{2/3+l) ^ ' j>(J-log2 J)/(2/3+l) 

+E E c2-j'(i+2/5i) 

< (^^log"^o(J-los,J)/(2/3+l) 

^_ ^2-2/9(.^-log2"f)/(2/3+l) _^ (^2T^-^-2ft 
^^/logn\ 2/3/(1+2/3) 

~ V 

Putting the above bounds together, one can easily see that 



J2 E{{T^,kf A 4A2 J < M|n-4" + 4M2 (n-2 A n'^/^) + C 

-2/3/(1+2/3) 

Jogn 

This proves the global upper bound (25). 



^ ^ -2/3/(1+2/3) 



logn 



< C max I n 
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6.3. Upper bound: Proof of Theorem 2. We now consider the bound given 
in (26) under pointwise MSE. Without loss of generahty, we shall assume 
that / and V are in the Lipschitz classes instead of the local Lipschitz classes, 
that is, we assume / e A"(M/) and V G A^(My). Note that 

E{V{x,)-Vix,)f 
i=l 

Jl ^ \ 2 

+ - dj,k)i^j,kix*) + ^i-.ki'jA^*) ) 

3=30 k j>Jl,k I 

^31 Xl(Oo,i-Oo,i)Vio,fc(3;*) 1 +31 YYl^^hk-Q3,k)^hk{^*)\ 
\%=\ / ^j=jo k / 



\i>Ji k ) 



^I^ + I^ + I^. 

Ii is bounded in the same way as in the global case. Since we are using 
wavelets of compact support, there are at most L basis functions ■0j\fc ^-t 
each resolution level j that are nonvanishing at x^, where L is the length 
of the support of the wavelet ip. Denote K{j,x^:) = {k '-ipjAx^) / 0}. Then 
\K{j,x^:)\ < L. Hence 

\j>JikeK{j,X:,) I \j>Ji / 

= 0(2-^i/^) = o(n-2/3/(i+2/3)). 

We now turn to l2- First, 

V 

'\1pj,ki^*)\ ■ 



\j,k 



Note that 

E{Oj^k — (^j,k)'^ < '^E{9j^k — Tj,k)'^ + '^iTj,k — dj,k 



+ 2E{zl,I{\zj,k\>Xj,k)). 



2 
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This means 

/2<96(E(^i,fcA4p|,fc)^/V,,fc(^*) 



+ 48( ^(rj,fc - 9j±)ipj±{x^) j 
V j,k ) 



+ 24((ii;(4,/(|zj-fc| > A,-fc))V'i,fc(x.))^/')l 

The last two terms fohow from the proof of the global upper bound and the 
second term is controlled by Lemma 5. For the first term, from the discussion 
in the proof of global upper bound, we have 



= E E (^ifcA4p2fc)i/2^,.fc(x,) 

io<i<("f-log2 J)/(2/3+l) kGK{j,x,) 

+ E E iOlk^^plk)'^%,k{x.) 

i>(J-log2 J)/{2/3+l) keK{j,x,) 

jo<j<(J-log2 J)/(2/3+l) ^ " 

+ J2 CL2J'/22-i(/3+i/2) 

i>(J-log2 J)/(2/3+l) 

^, /logn^^/(i+2/3) 



n / 

Putting these together, one can see that h < Cmax(?i~^", i]^)'^^^^^'^'^^^)- 
This proves the local upper bound (26). 

6.4. Lower bound: Proof of Theorem 3. We first outline the main ideas. 
The constrained risk inequality of Brown and Low (1996) implies that if an 
estimator has a small risk at one parameter value and {6i — Oq)^ ^ ep 
where p is the chi-square affinity between the distributions of the data under 
^0 and 6i, then its risk at 6i must be "large." Now the assumption (27) means 
that the estimator V{x^) has a small risk at 9q = Vq{x*). If we can construct 
a sequence of functions Vn such that Vn is "close" to Vq in the sense that 
p is small and at the same time A = |Ki(x*) — Vo(x=i,)| is "large," then it 
follows from the constrained risk inequality that V{x^) must have a "large" 
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risk at 9i = Vn{x^:). So the first step of tlie proof is a construction for such 
a sequence of functions Vn- 

Set Vo = 1 and let 5 be a compactly supported, infinitely differentiable 
function such that g{0) > 0, J g = and / (7^ = 1. Set 

Vn{x) = Vo{x) + T^g{T-^{x - x^)), 
where = (£l2£Z!;) 1/(1+2/3) < c < 1 is a constant. It is easy to check 
that fn are in A^(M) if the constant c is chosen sufficiently small. 

The chi-square affinity can be bounded same as before. Note that the 
chi-square affinity between $ = A^(0, 1) and ^ = A^(0, 1 + 7^) is 

(36) p(cI>,^) = (l-72)-i/2. 

Let yi = V{xi)zi, i = 1, . . . ,n, where Zi are i.i.d. A^(0, 1) variables. Denote 
by Pq and P„ the joint distributions of yi, . . . , y„ under V = Vo and V = Vn, 
respectively. Then it follows from (36) that 

Pn = p{Po,Pn) 



Ui^-r'Jg'ir-Hx^-xM^'^' 
1=1 

-h^^og{l-T^'^g^{T-^{xi-x^))) 



i=l 



< exp <^ r^^ 9^ ('^n ^{xi-x^)) 



i=l 



where the last step follows from the fact — ^ log(l — z) < z for < 2: < ^. Note 
that (nrJ-iEr=iff'(^n'(^*-a;*))^// = l, so T.i=i9\r-\xi - x,)) < 
2nr„ for sufficiently large n, and hence p„ < exp(2nT^^^^) < n^'^. Since the 
zero function is in A"o(Mj) and Vq S A^''(My), equation (27) implies that 
for some constant C > 0, 

E{V{x,) - VQ{x,)f < C7n-2/3i/(i+2/3i)j^-r- 

for r = min{4ao, ~ 1+2)31 ^ ^" Hence, for sufficiently large n, it follows 

from the constrained risk inequality of Brown and Low (1996) that 

S(1>(x.)-K(x.))^>^„^V(0)fl-^" " " 



T^ff(O) 

clogn \W(i+2/3) 
n J 



^ ~ (clogn/n)/3i/(i+2A)5(0) j 
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2/3i/(l+2/3i) 



by choosing the constant c < r/2. 
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